Information Retrieval and Extraction from the Web: the CROSSMARC approach

نویسندگان

  • Vangelis Karkaletsis
  • Constantine D. Spyropoulos
چکیده

The paper presents the CROSSMARC approach for the complex task of identification of interesting web sites and web pages and the extraction of information from them. This task is hard because most of the information on the Web today is in the form of HTML documents, which are designed for presentation purposes and not for automatic extraction systems. This task becomes even harder in a multilingual context, where web pages in different languages need to be considered. CROSSMARC approach focuses on the easy customization of web information retrieval and extraction technology to new domains and languages. This is achieved by adopting and implementing an open, multi-lingual and multi-agent architecture that integrates the CROSSMARC components into a web-based prototype system, as well as by providing an infrastructure that facilitates customization of its components to new domains and languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Use of Ontologies for Cross-lingual Information Management in the Web

We present the ontology-based approach for crosslingual information management of web content that has been developed by the EC-funded project CROSSMARC. CROSSMARC can be perceived as a meta-search engine, which identifies domainspecific information from the Web. To achieve this, it employs agents for web crawling, spidering, information extraction from web pages, data storage, and data present...

متن کامل

Cross-lingual Information Extraction from Web pages: the use of a general-purpose Text Engineering Platform

In this paper we present how the use of a general-purpose text engineering platform has facilitated the development of a cross-lingual information extraction system and its adaptation to new domains and languages. Our approach for crosslingual information extraction from the Web covers all the way from the identification of Web sites of interest, to the location of the domainspecific Web pages,...

متن کامل

Assessing the Internal Structure of the Ellis Information Retrieval Model in Order to Present the Persian Norm of Web Retrieval Tools

Introduction: Study evaluated the internal structure of Ellis information seeking model in the student community with the aim of presenting the Persian norm. Methods: This is a descriptive-analytical study conducted by cross-sectional survey method in the second semester of the academic year 1399-1400. Population comprise of 280 graduate students at Ahvaz Jundishapur University of Medical Scien...

متن کامل

Behavioral Considerations in Developing Web Information Systems: User-centered Design Agenda

The current paper explores designing a web information retrieval system regarding the searching behavior of users in real and everyday life. Designing an information system that is closely linked to human behavior is equally important for providers and the end users.  From an Information Science point of view, four approaches in designing information retrieval systems were identified as system-...

متن کامل

Multi-lingual XML-Based Named Entity Recognition in Web Pages

We describe the multilingual Named Entity Recognition and Classification (NERC) subpart of an information extraction system, which is currently under development as part of the EU-funded project CROSSMARC. The two main CROSSMARC goals are to develop commercial-strength technologies based on language processing methodologies for information extraction from web pages and to provide automated tech...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004